arXiv:1509.02409vl [cs.LG] 8 Sep 2015 


Data-selective Transfer Learning for Multi-Domain Speech Recognition 

Mortaza Doulaty, Oscar Saz, Thomas Hain 
Speech and Hearing Group, University of Sheffield, Sheffield, UK 

{mortaza.doulaty, o.saztorralba, t.hain}@Sheffield.ac.uk 


Abstract 

Negative transfer in training of acoustic models for auto¬ 
matic speech recognition has been reported in several contexts 
such as domain change or speaker characteristics. This paper 
proposes a novel technique to overcome negative transfer by 
efficient selection of speech data for acoustic model training. 
Here data is chosen on relevance for a specific target. A sub- 
modular function based on likelihood ratios is used to deter¬ 
mine how acoustically similar each training utterance is to a tar¬ 
get test set. The approach is evaluated on a wide-domain data 
set, covering speech from radio and TV broadcasts, telephone 
conversations, meetings, lectures and read speech. Experiments 
demonstrate that the proposed technique both finds relevant data 
and limits negative transfer. Results on a 6-hour test set show 
a relative improvement of 4% with data selection over using all 
data in PLP based models, and 2% with DNN features. 

Index Terms: data selection, transfer learning, negative trans¬ 
fer, speech recognition 

1. Introduction 

As Automatic Speech Recognition (ASR) systems improve 
their accuracy, new applications and domains become the tar¬ 
get of research. Automatic transcription of speech with un¬ 
known origin is a challenging task, which is related to access 
to so-called “found data”, such as media and historical audio 
archives. For this to be feasible, ASR has to produce an accu¬ 
rate output for whichever the conditions contained in the target 
data (e.g. interviews, distant recordings, telephone conversa¬ 
tions, etc). Training acoustic models for an unknown domain, 
e.g. YouTube recordings, can be infeasible if the origin of the 
target speech can not be properly assessed, and the loss of ac¬ 
curacy can be large due to wrong modelling decisions. Another 
option is to train an acoustic model on a large amount of data 
from multiple domains, although this is not guaranteed to give 
the most optimal results. 

Maximum Likelihood Estimation (MLE) of Gaussian Mix¬ 
ture Model (GMM) parameters of a Hidden Markov Model 
(HMM) is still a standard approach to train acoustic models 
in ASR, either with perceptually-based features like Percep¬ 
tual Linear Prediction (PLP) features o, or with Deep Neu¬ 
ral Network (DNN) based features (2) in tandem configuration. 
However, MLE has two well-known requirements: first, model 
correctness is assumed; and second the amount of training data 
is required to be infinite a. None of the above are valid in stan¬ 
dard situations in ASR, although systems are sometimes trained 
with many years of speech data (e.g 0). However, adding 
more data does not guarantee that the performance of the sys¬ 
tem will improve, and even if it does, the gains become smaller 
and smaller (S). A further effect, negative transfer, is found in 
several examples, which indicates that knowledge acquired for 
a task can have a negative performance effect in another task 


As a result, being able to select informative training data 
remains an important task. 

This paper studies positive and negative transfer in ASR in 
a multi-domain scenario. The work proposes to use submod- 
ular functions based on acoustic similarity between the target 
test set and training data, in which positive transfer will be ex¬ 
ploited to improve performance across domains, while reducing 
the impact of negative transfer at the same time. Submodu- 
lar functions have been successfully used before to select data 
in semisupervised training and active learning for ASR tasks 
(nil. However, here we show that these can also be used to 
select acoustically matching data in an un-supervised manner. 

This paper is structured as follows: Section provides a 
review of data selection techniques for ASR, and Section|^in- 
troduces the proposed approach for data selection. Section]^ 
describes the experimental setup, followed by results and anal¬ 
ysis in Section|^ The final Section|^summarises and concludes 
the paper. 

2. Data selection for ASR 

Data selection for ASR has mostly been studied for minimal 
representative data selection (sumiiiTiiiiiniEiiiiisi. Here 
the objective is, given a large pool of training data, to find a 
subset of data such that a model set trained with that data will 
achieve comparable performance to a model set trained with 
all the data. This line of work is related to active learning, 
where the aim is to select a subset for manual transcription with 
the least budget cmsiiiii, and with unsupervised and semi¬ 
supervised learning techniques, where the overall objective is to 
select a subset of the training set with the most reliable available 
transcripts |T7][T8][T9)- 

Two techniques are typically used for selecting data: uncer¬ 
tainty sampling 1201 . where the scores from an existing model 
are used to choose or reject data; and query by committee ED, 
where votes of distinctly trained models are used (71 For uncer¬ 
tainty sampling two types of scores have been explored. Con¬ 
fidence scores are used to select data with the most reliable 
transcriptions, as in semi-supervised training I17l l4l. or to se¬ 
lect data for manual transcription in active learning (BHH. 
Entropy-based methods aim to pick data that, for instance, fits 
a uniform distribution of target units (phonemes, words, etc), 
resulting in maximum entropy Emm or having a similar dis¬ 
tribution to a target set fT^[T^[T^ . 

The use of submodular functions has been proposed to 
tackle the effect of the diminishing returns, when adding more 
data to a training set ElEllTl. A submodular function is defined 
as any function / : 2^ —>■ R that fulfils 

f{S) + /(T) > f{S U T) + f{S n T), VS, T C G (1) 

With submodular functions the problem of data selection 
turns into a submodular maximisation problem, where the oh- 


jective is to find a subset S from the complete training set Q so 
that any new subset T added to S will not increase the value of 
the submodular function /: 

argmax{/(5)|/(5 U T) < f{S), Tcn\S} (2) 
sen 

Finding S is an NP-hard problem 1221 Isl and greedy solu¬ 
tions are proposed where the subset S is increased iteratively by 
the item s G fl that maximises the value of / when added to S 
as in Equation]^ 


s = argmax{/(S U {s})} (3) 

sen\s 

The set S is obtained when either the optimal S is found 
(fiS) > f{S U {s})), or a budget N is reached (|S| < N). 

If the function / is a normalised monotone submodular 
function, then the simple greedy algorithm provides a good ap¬ 
proximation of the optimal solution (23l|22l|7l 

Several functions / can be found in the literature to per¬ 
form data selection for ASR tasks, including facility location 
functions, saturated coverage functions I24| [8l. diversity reward 
functions (3) or graph cut functions 12J. 


3. Likelihood ratio data selection 

To decide whether data bears resemblance to a training set, one 
can opt for a classification approach that identifies an item to be 
suitable or not. Here we make use of the Likelihood Ratio (LR) 
between a GMM trained on the target data (Otgt), and a GMM 
trained on the complete training set (0n). The total LR of an 
utterance in the haining set LR{0), O € fl of length T frames 
is defined as the geometric mean of the frame-hased LR values 
of the target data model Qtgt and the background model 0n, 
assuming frame independence. 




(4) 


p{Ot\en) 

One can define a modular function (111 based on the accu¬ 
mulated LRs of all utterances included in a subset S' C O in the 
following form: 


fLR{S)^Yl 


(5) 


oes 


Modular functions are a special case of submodular func¬ 
tions I22l where the greater than or equal sign in Equation [T] 
changes to the equal sign. This way, the proposed function Jlr 
is submodular as well. And since all of the values for LR are 
non-negative, and therefore any sum of these numbers, as con¬ 
stituted by the function /, the function is necessarily mono¬ 
tonic with expanding sets (A C C /(A) < 

If a submodular function is non-decreasing and normalised 
(/(0) = 0), then the greedy solution obtained by Equationj^is 
no worse than the optimal value by a constant fraction (1 — i/e) 
Thus the subset S (greedy solution) can be used as the 
training set. The stopping criterion for adding more data to this 
subset S is based on a “budget”, in the form of a maximum 
amount of hours of speech to be used. 


4. Experimental setup 

To evaluate the proposed approach in a multi-domain ASR task, 
a data set combining 6 different types of data was chosen from 
the following sources: 


• Radio (RD): BBC Radio4 broadcasts on Eebruary 2009, 

• Television (TV): Broadcasts from BBC on May 2008. 

• Telephone speech (CT): From the Fisher corpu^ 1251 . 

• Meetings (MT): From AMI f2^ and ICSI ^TT\ corpora. 

• Lectures (TK): From TedTalks 1^ . 

• Read speech (RS): From the WSJCAMO corpus (2^ . 


A subset of lOh from each domain was selected to form the 
training set (60h in total), and Ih from each domain was used for 
testing (6h in total). The selection of the domains aims to cover 
the most common and distinctive types of audio recordings used 
in ASR tasks. 

Two types of acoustic features were used: first, 13 PLP 
features plus first and second derivatives for a total of 39- 
dimensional feature vectors; and second, a 65-dimensional fea¬ 
ture vector concatenating the 39 PLP features and 26 bottleneck 
(BN) features extracted from a 4-hidden-layer DNN trained on 
the full 60 hours of data. 31 adjacent frames (15 frames to the 
left and 15 frames to the right) of 23 dimensional log Mel fil¬ 
ter bank features were concatenated to form a 713-dimensional 
super vector; Discrete Cosine Transform (DCT) was applied to 
this super vector to de-correlate and compress it to 368 dimen¬ 
sions and then fed into the neural network. The network was 
trained on 4,000 triphone state targets and the 26 dimensional 
bottleneck layer was placed before the output layer. The objec¬ 
tive function used for training was frame-level cross-entropy 
and the optimisation was performed with stochastic gradient de¬ 
scent using the backpropagation algorithm. DNN haining was 
performed with the TNet toolkit 1301 and more details can be 
found at EH. For both types of features, MLE-based GMM- 
HMM models were trained using HTK (H with 5-state cross¬ 
word triphones and 16 gaussians per state. The language model 
was based on a 50,000-word vocabulary and was trained by 
combination of component language models for each of the 6 
domains. The interpolation weights were tuned using an inde¬ 
pendent development set. 

4.1. Baseline results 

Table [T] presents results using both types of acoustic features. 
These results show the large variety in performance among do¬ 
mains, from 17-18% for read speech and radio broadcasts to 
51% for television broadcasts. The use of DNN front-ends pro¬ 
vides a 25% relative improvement in performance against PLP 
features; which is consistent across domains and follows results 
previously seen in the literature 1331 . 


Table 1: WER (%) of models trained on full set 


Features 

RD 

TV 

CT 

MT 

TK 

RS 

Total 

PLP 

18.4 

51.1 

46.6 

44.0 

34.1 

17.3 

36.0 

PLP-l-BN 

13.3 

42.0 

33.5 

32.2 

23.5 

13.0 

26.8 


5. Results 

An initial set of experiments was conducted to identify and mea¬ 
sure negative transfer in ASR tasks, and an evaluation of the 
proposed data selection technique was performed. 


*A11 of the telephone speech data was up-sampled to 16 kHz to 
match the sampling rate of the rest of the data. 














Training domain added 


RD 


TV - 


CT 


MT- 


TK 


RS 


RD 

TV 

CT 

MT 

TK 

RS 

- 

7% 

-8% ' 


-4% 

-2% - 

- 2% 


-1% 

-2% 

0% 

-2% - 

- 0% 

0% 


2% 

3% 

-2% - 

- 0% 

0% 

0% 


2% 

0% - 

- 4% 

0% 

2% 

-1% 


-5% - 

1—1 

0% 

-16% 

-4% 

-3% 

- 


I 7.5 
5.0 

2.5 

Ho.o 

2.5 
5.0 

1 -7.5 

- 10.0 

-12.5 

-15.0 

-17.5 



Figure 2: WER improvement with budget-based data selection 


Figure 1: Relative WER improvement by adding cross-domain 
data to in-domain models 


5.1. Evaluation of negative transfer 

Six different domain-dependent MLE models were trained 
from the 10 hours of training data for each domain (in all of 
the experiments PLP features were used, unless stated other¬ 
wise). Each of these models was then used to decode the com¬ 
plete test set. The results in Tablej^show that in-domain results 
(when the train and test data match based on manually labelled 
domains) are not greatly different from those obtained with a 
model trained on 60-hour training set. Instead, cross-domain 
scores (train and test are mismatched) result in considerable per¬ 
formance decreases everywhere. 
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Figure 3: Types of data selected for a 10-hour budget 


Table 2: WER (%) with domain specific acoustic models using 
PLP features) 


Domain 

RD 

TV 

CT 

MT 

TK 

RS 

Total 

RD 

19.1 

55.1 

72.1 

57.2 

50.7 

24.9 

47.8 

TV 

26.5 

52.9 

77.3 

63.8 

52.1 

35.2 

52.5 

CT 

82.3 

90.1 

44.4 

71.9 

67.9 

86.6 

72.6 

MT 

44.9 

72.3 

69.2 

44.0 

51.1 

41.1 

54.7 

TK 

39.8 

62.8 

69.3 

56.1 

35.1 

55.4 

53.6 

RS 

29.9 

66.2 

84.1 

67.2 

68.9 

16.9 

57.4 


A second set of experiments was performed with models 
trained on 20 hours of data, using data from every possible pair 
of domains, for a total of 30 new acoustic models. Figure [T] 
shows the results in terms of relative improvement and degra¬ 
dation over the results of the 10-hour in-domain models. The 
rows of Figure [TJrepresent the testing domain and the columns 
represent the domain that was added in training to the data of the 
domain of the row. Positive values (blue squares) mark the exis¬ 
tence of positive transfer, such as adding TV data to Radio data 
(7% improvement) or adding Radio data to Lecture data (4% 
improvement). But negative values (red squares) mark negative 
transfer, like adding Telephone data to Read speech (16% loss) 
or adding Read speech to Lecture data (5% loss). 

These results showed that positive and negative transfer oc¬ 
curred across domains, possibly due to similarities and differ¬ 
ences in speech styles, acoustic channels and background condi¬ 
tions. Flowever a rule-based optimisation of the best model for 
each target domain would require a complex and error-prone 


process. The next experiments aimed to evaluate how an auto¬ 
matic selection of training could exploit positive transfer, while 
restricting negative transfer. 

5.2. Data selection based on budget 

The data selection technique proposed in Section]^ was evalu¬ 
ated next. For each of the six target test domains, Gaussian Mix¬ 
ture Models (GMM) with 512 mixtures were trained (©tgti.e). 
and a background 512-mixture GMM (0n) was trained from 
the complete training set of 60 hours. These GMMs were used 
to calculate the LR value for each training utterance (LR(0)) 
in order select the training data according to the acoustic simi¬ 
larity. 

The first evaluation was performed using data selection 
based on budget. Five possible budgets of 10, 20, 30, 40 and 
50 hours were designed for each test domain and the respective 
training data was chosen using the fcniS) submodular func¬ 
tion. Figure[^shows relative improvement for each domain and 
budget against the results with the 60-hour model. The graphs 
show that all domains improve performance as the budget in¬ 
creases until a certain limit is reached, then negative transfer 
decreases the performance, converging to the WER achieved 
with the 60-hour trained model. 

In order to observe which types of data were selected for 
each domain with the different budgets. Figure [^presents the 
percentage of training data selected for each test domain with 
a 10-hour budget. While the majority of the data was chosen 






















































from the same domain, some cross-domain data was also se¬ 
lected, indicating positive transfer between domains. This oc¬ 
curred, for instance, with TV and Read speech data towards 
Radio data; and Lecture data towards TV data. 

5.3. Automatic decision on budget 

An issue that can arise with the evaluated budget-based pro¬ 
posal is the fact that a decision on a budget has to be made, 
and as the results in Figure|^suggest, the optimal budget varies 
across different domains. A method for deciding a budget for a 
given target domain was proposed by selecting only utterances 
whose likelihood-ratio is above a threshold defined as the mean 
of the highest-weighted mixture of a GMM fitted to the dis¬ 
tribution of likelihood ratios. The use of the mixture with the 
highest weight avoids the influence of outliers in the distribution 
of the LR values. 

The experiments with an automatic budget decistion were 
performed for both types of features, PLP and PLP-l-BN. Ta¬ 
ble presents the results for these experiments and compares 
them to the outcome of data selection based on a 30-hour bud¬ 
get, which was the best fixed budget from Figure]^ The results 
showed that the use of an automatically derived threshold im¬ 
proved the results obtained with a fixed budget for both types 
of features, indicating that the proposed method could estimate 
the right amount of data to select for each target domain. 


Table 3: WER(%) using data selection 


Method 

RD 

TV 

CT 

MT 

TK 

RS 

Total 

PLP features 

Budget-30h. 

17.7 

50.0 

44.2 

43.4 

33.4 

15.5 

34.9 

Auto. Decision 

17.7 

49.7 

44.2 

43.8 

32.9 

15.1 

34.7 

PLP-l-BN features 

Budget-30h. 

13.0 

41.5 

32.6 

32.1 

22.5 

12.1 

26.3 

Auto. Decision 

12.7 

41.4 

32.5 

32.3 

22.4 

11.8 

26.2 


The amount of data selected for each domain is presented 
in Table This Table shows how Read speech and Conver¬ 
sational Telephone speech are the ones which benefited from 
a lower amount of training data (20 hours or less), while the 
rest of the domains preferred more data (from 30 to 40 hours). 
These values were consistent with the patterns of positive and 
negative transfer observed in Figure]^ 

Table 4: Flours of data selected by automatic budget decision 


Domain 

RD 

TV 

1 CT 

1 MT 

TK 1 

RS 1 

Hours 

41.2 

35.8 

1 21.9 

35.6 

31.4 1 

17.1 1 


6. Conclusion 

In this paper, the effect of positive and negative transfer across 
widely diverse domains in ASR was explored. We confirmed 
that the use of more data in MLE-based acoustic models does 
not always provide increases in performance. A submodular 
function based on Likelihood Ratio was proposed and used to 
perform an informed and efficient selection of data for different 
target test sets. The evaluation of selection techniques based on 
budget and on automatic budget decision has achieved gains of 
4% over a 60-hour MLE model for PLP features and 2% for 
PLP-l-BN features. 


Previous works have shown that data selection techniques 
can result in data sets biased towards specific groups of phones 
or triphones oa. A phonetic analysis of the data sets given 
by the likelihood ratio function used in this paper did not show 
any bias on phones in these data sets. The 60-hour training 
data used in this work was well balanced phonetically which 
limited the risk of phonetic biases in the selected data. In situ¬ 
ations where the original training data might present less well 
distributed phonetic content, the proposed function should be 
complemented by a function that takes into account the result¬ 
ing phone distribution of the data. 

Future work should explore similar data selection tech¬ 
niques for other training criteria besides MLE. The presented 
methods are based on LR and hence well-suited for MLE, but 
other submodular functions will be required to cater for needs 
given by discriminative objective functions such as Minimum 
Phone Error training. Further work should also investigate data 
selection techniques for datasets larger than the one studied 
here, and in completely mismatched conditions and using dif¬ 
ferent features that better describe the background’s acoustic 
characteristics (Ml- 

The technique presented in this paper can be used for build¬ 
ing targeted models for “found speech data”. The ability of 
using very diverse data sets to transcribe newly found sets of 
speech recorded in unknown conditions is especially necessary 
to deal with this type of data. Other tasks, such as the automatic 
transcription of multi-genre media archives might also poten¬ 
tially benefit from the advances achieved in this work. 
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